CSCI E83 – Fall Term 2023- Fundamental of Data Science¶

Graduate Credit Project – Asma Naeem: Time Series Demand Forecasting¶

Executive Summary:¶

Forecasting plays a crucial role in airline management—Airlines project demand to plan the supply of services to respond to that demand. Forecasts enable airlines to make informed decisions about pricing, seat inventory control, and catering. Decisions like advertising and sales campaigns, aircraft scheduling, maintenance planning, and opening new sales offices depend upon predicted demand. Forecasts facilitate the creation of new routes and the acquisition of new aircraft.
However, accurate forecasting of bookings is a challenging task where seasonality plays an important role. Airlines see a surge in demand during the summer holidays or the festive season compared to the rest of the year. Competition, economic circumstances, and external happenings such as weather conditions, natural disasters, or political unrest also affect flight bookings. Airline companies rely on accurate forecasting to offer competitive prices to attract customers and gain a larger market share.
Commonly used techniques to generate a probable data set that relates back to the original data include: -

  • Machine learning: A discipline of artificial intelligence that allows computers to learn from data and past experiences without explicit programming to recognize patterns and make future predictions. ML can be used to project bookings.
  • Time Series Analysis: A way of analyzing historical data and identifying patterns and trends. This technique can identify seasonal trends, such as high demand during peak travel seasons, and forecast future demand based on historical patterns.
  • Regression analysis: A method to examine the relationship between two or more variables. It can project air travel demand depending upon various aspects, such as the destination, price of tickets, and the time of year.

Problem Statement:¶

The objective of this project is to come up with the best forecasting model to predict air travel demand. The selected dataset is public data, which is available from data.world(Original Source: San Francisco Open Data). This dataset is a San Francisco International Airport Report on Monthly Passenger Traffic Statistics by Airline from 2005 to 2016. Airport data is seasonal. Hence, any comparative analyses would be performed on a period-over-period basis (such as March 2015 vs. March 2016) instead of period-to-period (such as. March 2015 vs. April 2015). Notably, fact and attribute field relationships are not always 1-to-1. For example, Passenger Counts belonging to Southwest Airlines will be seen in multiple attribute fields and are additive, allowing us to draw categorical Passenger Counts for the analyses.

Value Proposition:¶

The forecasting model developed by this project can be used to predict air travel demand. Airline companies need to make informed decisions about pricing, capacity, and other areas of their business to optimize their operations and maximize their profitability.
We will use the five steps mentioned below to find the best forecasting model to predict the demand for air travel.

  1. Gather Data
  2. Data Pre-Processing and Series Plot Visualization
  3. Exploratory Data Analysis(EDA)
  4. Create, Fit Model and Generate Forecast
  5. Evaluate Model and Conclusion

Python Libraries:¶

In [1]:
from math import sin
from sklearn.preprocessing import PowerTransformer
%matplotlib inline
from collections import defaultdict
import warnings
import itertools
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import scipy as stat
from scipy.stats import t
from scipy.stats import norm
from scipy.stats import chi2
import numpy.random as nr
from math import sqrt
import math
from pandas import Series
from statsmodels.tsa.stattools import adfuller
from matplotlib import pyplot
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from matplotlib import pyplot as plt
import warnings
import itertools
import numpy as np
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
from pandas import read_csv
from pandas import DataFrame
from pandas import Grouper
from patsy import dmatrices
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.gofplots import qqplot
from pandas.plotting import autocorrelation_plot
from math import sqrt
from math import log
from math import exp
from scipy.stats import boxcox
from pandas import read_csv
from pandas import datetime
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from statsmodels.graphics.gofplots import qqplot
from pandas.plotting import autocorrelation_plot
import statsmodels.api as sm
from pandas import read_csv
from pandas import DataFrame
from scipy.stats import boxcox
from pandas import read_csv
from sklearn.metrics import mean_absolute_percentage_error
from statsmodels.tsa.arima_model import ARIMA
from scipy.stats import boxcox
warnings.filterwarnings("ignore")
import matplotlib.ticker as ticker 
from IPython.display import display, HTML
import plotly.express as px
import plotly.graph_objects as go
display(HTML("<style>.container { width:100% !important; }</style>"))

1. Gather Data¶

1.1 Load Data¶

In [2]:
air_travel = pd.read_csv('dataset//Air_Traffic_Passenger_Statistics.csv')
print('Number of Observations: ', len(air_travel))
display(air_travel.head())
Number of Observations:  15007
Activity Period Operating Airline Operating Airline IATA Code Published Airline Published Airline IATA Code GEO Summary GEO Region Activity Type Code Price Category Code Terminal Boarding Area Passenger Count Adjusted Activity Type Code Adjusted Passenger Count Year Month
0 200507 ATA Airlines TZ ATA Airlines TZ Domestic US Deplaned Low Fare Terminal 1 B 27271 Deplaned 27271 2005 July
1 200507 ATA Airlines TZ ATA Airlines TZ Domestic US Enplaned Low Fare Terminal 1 B 29131 Enplaned 29131 2005 July
2 200507 ATA Airlines TZ ATA Airlines TZ Domestic US Thru / Transit Low Fare Terminal 1 B 5415 Thru / Transit * 2 10830 2005 July
3 200507 Air Canada AC Air Canada AC International Canada Deplaned Other Terminal 1 B 35156 Deplaned 35156 2005 July
4 200507 Air Canada AC Air Canada AC International Canada Enplaned Other Terminal 1 B 34090 Enplaned 34090 2005 July

2. Data Pre-Processing and Series Plot Visualization¶

2.1 Describe the Data Summary¶

Generate descriptive statistics to summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

In [3]:
air_travel.describe() 
Out[3]:
Activity Period Passenger Count Adjusted Passenger Count Year
count 15007.000000 15007.000000 15007.000000 15007.000000
mean 201045.073366 29240.521090 29331.917105 2010.385220
std 313.336196 58319.509284 58284.182219 3.137589
min 200507.000000 1.000000 1.000000 2005.000000
25% 200803.000000 5373.500000 5495.500000 2008.000000
50% 201011.000000 9210.000000 9354.000000 2010.000000
75% 201308.000000 21158.500000 21182.000000 2013.000000
max 201603.000000 659837.000000 659837.000000 2016.000000

We observed that United Airlines, the main carrier at SFO, is listed as 'United Airlines' and 'United Airlines - Pre 07/01/2013'. There is no reference available that explains the second label. For analyzing data prior to July 2013, we relabeled those cases as 'United Airlines'.

In [4]:
air_travel.loc[air_travel["Operating Airline"].str.startswith('United Airlines'), "Operating Airline"] = 'United Airlines'

2.2 Create and Format Date columns for calculation¶

This dataset provides monthly passenger statistics that can be sliced under categories such as geographic regions, price, and activity type codes. We will format travel year and month to project the demand using various models.

In [5]:
month_num = {'January': '01', 'February': '02', 'March': '03', 'April': '04', 'May': '05', 'June': '06', 'July': '07', 
             'August': '08', 'September': '09', 'October': '10', 'November': '11', 'December': '12'}
air_travel['MM'] = air_travel['Month'].map(month_num)
air_travel['YY_MM']= (air_travel['Year'].astype(str) + "-" + air_travel['MM'].astype(str)).astype(str)
air_travel['YY_Month']= (air_travel['Year'].astype(str) + "-" + air_travel['Month']).astype(str)
air_travel['Activity_Period_Start_Date'] = pd.to_datetime(air_travel['Year'].astype(str)  + air_travel['Month'], format='%Y%B')
air_travel.head()
Out[5]:
Activity Period Operating Airline Operating Airline IATA Code Published Airline Published Airline IATA Code GEO Summary GEO Region Activity Type Code Price Category Code Terminal Boarding Area Passenger Count Adjusted Activity Type Code Adjusted Passenger Count Year Month MM YY_MM YY_Month Activity_Period_Start_Date
0 200507 ATA Airlines TZ ATA Airlines TZ Domestic US Deplaned Low Fare Terminal 1 B 27271 Deplaned 27271 2005 July 07 2005-07 2005-July 2005-07-01
1 200507 ATA Airlines TZ ATA Airlines TZ Domestic US Enplaned Low Fare Terminal 1 B 29131 Enplaned 29131 2005 July 07 2005-07 2005-July 2005-07-01
2 200507 ATA Airlines TZ ATA Airlines TZ Domestic US Thru / Transit Low Fare Terminal 1 B 5415 Thru / Transit * 2 10830 2005 July 07 2005-07 2005-July 2005-07-01
3 200507 Air Canada AC Air Canada AC International Canada Deplaned Other Terminal 1 B 35156 Deplaned 35156 2005 July 07 2005-07 2005-July 2005-07-01
4 200507 Air Canada AC Air Canada AC International Canada Enplaned Other Terminal 1 B 34090 Enplaned 34090 2005 July 07 2005-07 2005-July 2005-07-01

Specifically, adjusted passenger numbers for the Activity Type Code 'Enplaned' are filtered for the time series analysis.

In [6]:
air_travel_original = air_travel
air_travel = air_travel.loc[air_travel['Activity Type Code'] == 'Enplaned']

2.3 Time Bound Data Summary¶

We select the data for the activity type code 'Enplaned' and group by 'Year' to understand the data count per column and ensure that we have sufficient data points for each year.

In [7]:
air_travel.groupby('Year').count()
Out[7]:
Activity Period Operating Airline Operating Airline IATA Code Published Airline Published Airline IATA Code GEO Summary GEO Region Activity Type Code Price Category Code Terminal Boarding Area Passenger Count Adjusted Activity Type Code Adjusted Passenger Count Month MM YY_MM YY_Month Activity_Period_Start_Date
Year
2005 317 317 317 317 317 317 317 317 317 317 317 317 317 317 317 317 317 317 317
2006 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624 624
2007 647 647 647 647 647 647 647 647 647 647 647 647 647 647 647 647 647 647 647
2008 653 653 653 653 653 653 653 653 653 653 653 653 653 653 653 653 653 653 653
2009 638 638 638 638 638 638 638 638 638 638 638 638 638 638 638 638 638 638 638
2010 636 636 633 636 633 636 636 636 636 636 636 636 636 636 636 636 636 636 636
2011 646 646 639 646 639 646 646 646 646 646 646 646 646 646 646 646 646 646 646
2012 650 650 642 650 642 650 650 650 650 650 650 650 650 650 650 650 650 650 650
2013 661 661 655 661 655 661 661 661 661 661 661 661 661 661 661 661 661 661 661
2014 659 659 658 659 658 659 659 659 659 659 659 659 659 659 659 659 659 659 659
2015 704 704 704 704 704 704 704 704 704 704 704 704 704 704 704 704 704 704 704
2016 181 181 181 181 181 181 181 181 181 181 181 181 181 181 181 181 181 181 181
In [8]:
time_bound_summary = air_travel[['Adjusted Passenger Count', 'Year','MM']].pivot_table(index=['MM'],columns=['Year'],aggfunc=sum, fill_value=0).T
time_bound_summary
Out[8]:
MM 01 02 03 04 05 06 07 08 09 10 11 12
Year
Adjusted Passenger Count 2005 0 0 0 0 0 0 1569412 1553146 1363140 1384822 1304433 1350000
2006 1182313 1094470 1337770 1371576 1408991 1570272 1583055 1539211 1353586 1417003 1325354 1374406
2007 1217679 1144391 1405982 1420480 1521884 1650947 1656412 1687024 1471094 1568276 1465087 1477301
2008 1297099 1294648 1560380 1504074 1653760 1738334 1761579 1774735 1499313 1569545 1383133 1491674
2009 1290035 1173406 1454796 1500465 1592457 1733926 1798641 1790589 1602188 1633418 1490890 1550460
2010 1359375 1249672 1541956 1559704 1687160 1836349 1846490 1845624 1683566 1752673 1585534 1591589
2011 1409136 1295211 1561541 1593407 1771329 1900610 1932621 1923748 1783953 1804254 1670612 1742295
2012 1571845 1493150 1739289 1764188 1915521 2078066 2103777 2165149 1890812 1934113 1744729 1747550
2013 1558685 1478579 1794640 1774595 1988752 2092041 2060059 2154833 1884343 1957611 1756833 1916814
2014 1675812 1533190 1879974 1916848 2078697 2180660 2195683 2248755 1948420 2034949 1830334 1941106
2015 1736575 1611621 2000314 1992518 2173731 2309593 2351298 2356914 2099362 2216408 2017137 2089548
2016 1825846 1753761 2070896 0 0 0 0 0 0 0 0 0

2.4 Stratify Time Series for Training and Testing Dataset¶

In [9]:
# plot overall dataset plot
dmd_series=air_travel[['YY_MM','Adjusted Passenger Count']].groupby('YY_MM').sum()
dmd_series.plot(figsize=(20, 5), title="Air Travel Demand Plot" , xlabel = "Month-Year", fontsize=12)
plt.show()
In [10]:
warnings.filterwarnings("ignore")
fig, ax = plt.subplots(figsize=(20, 5) )

train_pctg = 79 
X = dmd_series
train_size = int(len(X) * train_pctg / 100)
train, test = X[0:train_size], X[train_size:len(X)]
print('Observations: %d' % (len(X)))
print('Training Observations: %d' % (len(train)))
print('Testing Observations: %d' % (len(test)))
plt.plot(train ,'o-', label='Training Data', linewidth=3)
plt.plot(test , 'o--', label='Testing Data', linewidth=2)

ax.set_title("Air Travel Demand Time Series", fontsize='15')

ax.set_xlabel("Demand Period - Year-Months" , fontsize=15)
ax.set_ylabel("Demand Quantity ")
# Rotating X-axis labels
ax.tick_params(axis='x', labelrotation = 50)
space = 6
ax.xaxis.set_major_locator(ticker.MultipleLocator(space)) 
plt.legend()
plt.show()
Observations: 129
Training Observations: 101
Testing Observations: 28

2.7 Time Series Visualisation by Region¶

We excluded 'South America' region due to insufficient data points.

In [13]:
warnings.filterwarnings("ignore")
c = 0 
region=air_travel[air_travel['GEO Region']!='South America']['GEO Region'].unique()
fig, ax = plt.subplots(nrows=len(region), ncols=1, figsize=(20, 40))

for s in region:
    region_series = air_travel[(air_travel['GEO Region']==s) & (air_travel['Year']<2016)  & (air_travel['GEO Region']!='South America')]
    region_time_data=region_series[['YY_MM','Adjusted Passenger Count']].groupby('YY_MM').sum()
    X = region_time_data
    train_size = int(len(X) * train_pctg / 100)
    train, test = X[0:train_size], X[train_size:len(X)]
    
    ax[c].plot(train,  label='Training Data', linewidth=2)
    ax[c].plot(test , 'r--', label='Testing Data', linewidth=2)
    # Rotating X-axis labels
    ax[c].tick_params(axis='x', labelrotation = 50)
    ax[c].tick_params(axis='x', labelrotation = 50)
    ax[c].xaxis.set_major_locator(ticker.MultipleLocator(space)) 
    ax[c].set_title("{} Region Demand Time Series".format(s), fontsize='12')
    ax[c].set_xlabel("Demand Period - Year-Months" , fontsize=12)
    ax[c].set_ylabel("Demand Quantity ")
    c+=1
    
# using padding
fig.tight_layout(pad=5.0)

plt.legend()
plt.show()

2.10 Adjusted Passenger Count by Operating Airline¶

In [16]:
df_airline = air_travel_original.groupby(["Operating Airline"]).sum().sort_values("Adjusted Passenger Count", ascending=False).head(20) 
df_airline.reset_index(inplace=True) 
plt.figure(figsize = (15,7))
plt.title(" Adjusted Passenger Count by Operating Airline", fontsize=18) 
plt.bar(df_airline["Operating Airline"], df_airline["Adjusted Passenger Count"],color= 'silver', #'#227d3d',
        edgecolor='black', linewidth = 1)
plt.xlabel("Operating Airline",fontsize=15)
plt.ylabel("Adjusted Passenger Count",fontsize=15) 
plt.xticks(fontsize=12, rotation=90)
plt.yticks(fontsize=12)
for k,v in df_airline["Adjusted Passenger Count"].items():
    if v>300000:
        plt.text(k,v-120000, str(v), fontsize=12,rotation=90,color='k', horizontalalignment='center');
    else:
        plt.text(k,v+15000, str(v), fontsize=12,rotation=90,color='k', horizontalalignment='center');

Air Traffic Summary¶

We use different categorical variables to create a summary plot with sankey. We will use the sankey_ly function, a plotly wrapper that creates the data transformation and plots with plotly. We will rank the top 20 airlines during 2015 and plot the passengers distribution by airline, travel type (domestic or international), travel geo region, activity type (enplaned, deplaned and transit) and fare type (low or other):

In [17]:
yr_end = 2015
top_n = 20
air_travel_f =air_travel_original.loc[(air_travel_original['Year'] == yr_end)]

air_travel_g = air_travel_f.groupby(['Operating Airline']).sum('Adjusted Passenger Count').reset_index()
top_n_vals= air_travel_g.nlargest(top_n, 'Adjusted Passenger Count')
air_travel_f = air_travel_f[air_travel_f['Operating Airline'].isin(top_n_vals['Operating Airline'])]


color_link = ['#000000', '#FFFF00', '#1CE6FF', '#FF34FF', '#FF4A46',
             '#008941', '#006FA6', '#A30059','#FFDBE5', '#7A4900', 
             '#0000A6', '#63FFAC', '#B79762', '#004D43', '#8FB0FF',
             '#997D87', '#5A0007', '#809693', '#FEFFE6', '#1B4400', 
             '#4FC601', '#3B5DFF', '#4A3B53', '#FF2F80', '#61615A',
             '#BA0900', '#6B7900', '#00C2A0', '#FFAA92', '#FF90C9',
             '#B903AA', '#D16100', '#DDEFFF', '#000035', '#7B4F4B',                
             '#A1C299', '#300018', '#0AA6D8', '#013349', '#00846F',
             '#372101', '#FFB500', '#C2FFED', '#A079BF', '#CC0744',
             '#C0B9B2', '#C2FF99', '#001E09', '#00489C', '#6F0062', 
             '#0CBD66', '#EEC3FF', '#456D75', '#B77B68', '#7A87A1',
             '#788D66', '#885578', '#FAD09F', '#FF8A9A', '#D157A0',
             '#BEC459', '#456648', '#0086ED', '#886F4C', '#34362D', 
             '#B4A8BD', '#00A6AA', '#452C2C', '#636375', '#A3C8C9', 
             '#FF913F', '#938A81', '#575329', '#00FECF', '#B05B6F',
             '#8CD0FF', '#3B9700', '#04F757', '#C8A1A1', '#1E6E00',
             '#7900D7', '#A77500', '#6367A9', '#A05837', '#6B002C',
             '#772600', '#D790FF', '#9B9700', '#549E79', '#FFF69F', 
             '#201625', '#72418F', '#BC23FF', '#99ADC0', '#3A2465',
             '#922329', '#5B4534', '#FDE8DC', '#404E55', '#0089A3',
             '#CB7E98', '#A4E804', '#324E72', '#6A3A4C']

air_travel_g =air_travel_f.groupby(['Operating Airline', 'GEO Summary']).sum('Adjusted Passenger Count').reset_index()

src = air_travel_g['Operating Airline']
tgt = air_travel_g['GEO Summary']
vals = air_travel_g['Adjusted Passenger Count']

air_travel_g =air_travel_f.groupby(['Operating Airline', 'GEO Summary', 'GEO Region']).sum('Adjusted Passenger Count').reset_index()

src = pd.concat([src, air_travel_g['GEO Summary']])
tgt = pd.concat([tgt, air_travel_g['GEO Region']])
vals = pd.concat([vals, air_travel_g['Adjusted Passenger Count']])

air_travel_g =air_travel_f.groupby(['Operating Airline', 'GEO Summary', 'GEO Region', 'Adjusted Activity Type Code']).sum('Adjusted Passenger Count').reset_index()

src = pd.concat([src, air_travel_g['GEO Region']])
tgt = pd.concat([tgt, air_travel_g['Adjusted Activity Type Code']])
vals = pd.concat([vals, air_travel_g['Adjusted Passenger Count']])

air_travel_g =air_travel_f.groupby(['Operating Airline', 'GEO Summary', 'GEO Region', 'Adjusted Activity Type Code', 'Price Category Code']).sum('Adjusted Passenger Count').reset_index()

src = pd.concat([src, air_travel_g['Adjusted Activity Type Code']])
tgt = pd.concat([tgt, air_travel_g['Price Category Code']])
vals = pd.concat([vals, air_travel_g['Adjusted Passenger Count']])

node_label = pd.concat([src, tgt]) 
unq = set(node_label)
node_label = np.array(list(unq))

node_dict = {y:x for x, y in enumerate(node_label)}
source_node = [node_dict[x] for x in src]
target_node = [node_dict[x] for x in tgt]

link3 = dict(source=source_node, target=target_node, value=vals, color=color_link)
node3 = dict(label = node_label, pad=35, thickness=20)
data3 = go.Sankey(link=link3)

fig3 = ig = go.Figure( 
    data=[go.Sankey( # The plot we are interest
        # This part is for the node information
        node = dict( 
            label = node_label
        ),
        # This part is for the link information
        link = dict(
            source = source_node,
            target = target_node,
            value = vals
        ))])
fig3.update_layout(
    hovermode='x',
    title='Passengers Distribution During 2015',
    font=dict(size=10, color='white'),
    paper_bgcolor='#51504f'
)
fig3.show()

Let’s see the distribution of passengers by geo type (domestic vs. international):

In [18]:
air_travel_sum =air_travel_original.loc[(air_travel_original['Year'] == yr_end)].groupby(['GEO Summary']).sum('Adjusted Passenger Count').reset_index()
fig = px.pie(air_travel_original, values='Adjusted Passenger Count', names='GEO Summary')
fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')
fig.show()

About 77% of the air passengers traffic during 2015 were domestic.

Let’s use tree map plot to see the distribution of passengers by airline and geo region:

In [19]:
fig = px.treemap(air_travel_original.loc[(air_travel_original['Year'] == yr_end) & (air_travel_original['GEO Summary'] == 'Domestic')], path=[px.Constant("Domestic"), 'Operating Airline'], 
                 values='Adjusted Passenger Count', color='Operating Airline', color_discrete_map={'(?)':'lightgrey'})
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.data[0].textinfo = 'label+text+value+percent parent'
fig.show()

fig = px.treemap(air_travel_original.loc[(air_travel['Year'] == yr_end) & (air_travel_original['GEO Summary'] == 'International')], path=[px.Constant("International"), 'Operating Airline'], 
                 values='Adjusted Passenger Count', color='Operating Airline', color_discrete_map={'(?)':'lightgrey'})
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.data[0].textinfo = 'label+text+value+percent parent'
fig.show()

We can see that United Airlines is the major carrier in both domestic and international flights.

United States Travel Expenditures:¶

In [20]:
import plotly.graph_objects as go
yr_str = 2005
yr_end = 2016
air_travel_sum =air_travel_original.loc[((air_travel_original['Year'] > yr_str) 
                                     & (air_travel_original['Year'] < yr_end) 
                                     & (air_travel_original['GEO Summary'] == 'Domestic'))]
air_travel_sum = air_travel_sum.groupby(['Year']).sum('Adjusted Passenger Count').reset_index()
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=air_travel_sum['Year'], y=air_travel_sum['Adjusted Passenger Count'],
    hoverinfo='x+y',
    mode='lines',
    line=dict(width=0.5, color='rgb(184, 247, 212)'),
    stackgroup='one',
    name = 'Domestic'
))

air_travel_sum =air_travel_original.loc[((air_travel_original['Year'] > yr_str) 
                                     & (air_travel_original['Year'] < yr_end) 
                                     & (air_travel_original['GEO Summary'] == 'International'))]
air_travel_sum = air_travel_sum.groupby(['Year']).sum('Adjusted Passenger Count').reset_index()
fig.add_trace(go.Scatter(
    x=air_travel_sum['Year'], y=air_travel_sum['Adjusted Passenger Count'],
    hoverinfo='x+y',
    mode='lines',
    line=dict(width=0.5, color='rgb(131, 90, 241)'),
    name = 'International',
    stackgroup='one' # define stack group
))
fig.update_layout()
fig.show()

We observe that there is a significant growth in the trend from 2007 up until 2012, with passenger numbers levelling off after that. This is applicable to both domestic and international travel. However, there is a larger demand for domestic air travel.

Monthly distribution of passengers:¶

In [21]:
air_travel_sum =air_travel_original.loc[((air_travel_original['Operating Airline'] == 'KLM Royal Dutch Airlines') 
                                     | (air_travel_original['Operating Airline'] == 'American Airlines') 
                                     | (air_travel_original['Operating Airline'] == 'Delta Air Lines') 
                                     | (air_travel_original['Operating Airline'] == 'Southwest Airlines') 
                                     | (air_travel_original['Operating Airline'] == 'United Airlines'))]
air_travel_sum = air_travel_sum.groupby(['Operating Airline', 'Year', 'Month']).sum('Adjusted Passenger Count').reset_index()
month_names = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
fig = px.scatter(air_travel_sum, y="Year", x="Adjusted Passenger Count", color="Operating Airline", facet_col="Month", 
                 labels={"Adjusted Passenger Count": "Passengers"},
                 category_orders={"Month": month_names})
fig.show()

Here is the monthly distribution of pessengers for selected airlines during 2015:

In [22]:
air_travel_sum =air_travel_original.loc[((air_travel_original['Operating Airline'] == 'Southwest Airlines') 
                                     | (air_travel_original['Operating Airline'] == 'Delta Air Lines') 
                                     | (air_travel_original['Operating Airline'] == 'United Airlines'))]
air_travel_sum = air_travel_sum.groupby(['Operating Airline', 'Year', 'MM']).sum('Adjusted Passenger Count').reset_index()
In [23]:
import plotly.graph_objects as go

fig = go.Figure()

yr_end = 2015

fig.add_trace(go.Scatter(
    x=month_names, y=air_travel_sum.loc[(air_travel_sum['Year'] == yr_end) 
                                     & (air_travel_sum['Operating Airline'] == 'Southwest Airlines')]['Adjusted Passenger Count'],
    hoverinfo='x+y',
    mode='lines',
    line=dict(width=0.5, color='rgb(184, 247, 212)'),
    stackgroup='one',
    name = 'Southwest Airlines'
))
fig.add_trace(go.Scatter(
    x=month_names, y=air_travel_sum.loc[(air_travel_sum['Year'] == yr_end) 
                                     & (air_travel_sum['Operating Airline'] == 'Delta Air Lines')]['Adjusted Passenger Count'],
    hoverinfo='x+y',
    mode='lines',
    line=dict(width=0.5, color='rgb(111, 231, 219)'),
    stackgroup='one',
    name = 'Delta Air Lines'
))
fig.add_trace(go.Scatter(
    x=month_names, y=air_travel_sum.loc[(air_travel_sum['Year'] == yr_end) 
                                     & (air_travel_sum['Operating Airline'] == 'United Airlines')]['Adjusted Passenger Count'], 
    hoverinfo='x+y',
    mode='lines',
    line=dict(width=0.5, color='rgb(131, 90, 241)'),
    name = 'United Airlines',
    stackgroup='one' # define stack group
))
fig.update_layout()
fig.show()

We see that passenger numbers appear to be highest from approximately May — September, after which we see a dip in numbers for the rest of the year.

In [24]:
df_piv = air_travel_original[["Price Category Code","GEO Summary", "Adjusted Passenger Count"]]
pd.pivot_table(df_piv , values = 'Adjusted Passenger Count' , index = 'GEO Summary' , columns = 'Price Category Code' , aggfunc = 'sum')
Out[24]:
Price Category Code Low Fare Other
GEO Summary
Domestic 74527292 264515345
International 680815 100460628
In [25]:
sns.countplot(data = air_travel_original, x = 'Price Category Code' , hue = 'GEO Summary'  , palette = 'viridis')
Out[25]:
<Axes: xlabel='Price Category Code', ylabel='count'>

The above graph shows that the 'Low fare' price code category is popular in Domestic flights. The 'Other' price category code has the most passengers on international flights.

In [26]:
sns.countplot(data = air_travel_original , x = 'Price Category Code' , hue = 'GEO Region'  , palette = 'viridis')
Out[26]:
<Axes: xlabel='Price Category Code', ylabel='count'>

We see the highest air travel demand in the US region under the 'Other' price category code. To evaluate the forecasting models, we will select the 'Other' price code category and US region as sample data.

3. Exploratory Data Analysis. (EDA)¶

3.1 Create Datasets for Forecasting by Price Category Code , GEO Region , GEO Summary , and Operating Airline¶

Here are the main attributes that can be used to slice the data: Price Category Code, GEO Region, GEO Summary, and Operating Airline. We will filter and create separate datasets for each attribute to use in the forecasting model. In our visualizations above, we observed the highest records for some combined attributes, so we will also create some combined datasets to test our model on such high-demand attributes. Sometimes, forecasting on a focused dataset helps businesses improve product projection accuracy across regions, increasing overall sales and profitability.

In [27]:
print(price_codes)
print(region)
print(geo_summary)
['Low Fare' 'Other']
['US' 'Canada' 'Asia' 'Europe' 'Australia / Oceania' 'Mexico'
 'Central America' 'Middle East']
['Domestic' 'International']
In [28]:
# by Price Category Code
low_fare_price_ds= air_travel[air_travel['Price Category Code']=='Low Fare']
other_price_ds= air_travel[air_travel['Price Category Code']=='Other']
In [29]:
# by Price GEO Summary Code
domestic_ds= air_travel[air_travel['GEO Summary']=='Domestic']
international_ds= air_travel[air_travel['GEO Summary']=='International']
In [30]:
# by GEO Region
us_ds= air_travel[air_travel['GEO Region']=='US']
canada_ds= air_travel[air_travel['GEO Region']=='Canada']
asia_ds= air_travel[air_travel['GEO Region']=='Asia']
europe_ds= air_travel[air_travel['GEO Region']=='Europe']
australia_ds= air_travel[air_travel['GEO Region']=='Australia / Oceania']
mexico_ds= air_travel[air_travel['GEO Region']=='Mexico']
central_america_ds= air_travel[air_travel['GEO Region']=='Central America']
middle_east_ds= air_travel[air_travel['GEO Region']=='Middle East']
In [31]:
# Create Combined datasets
us_otherPriceCode_ds= air_travel[ (air_travel['Price Category Code']=='Other') & (air_travel['GEO Region']=='US') ]
domestic_lowFarePriceCode_ds= air_travel[ (air_travel['Price Category Code']=='Low Fare') & (air_travel['GEO Summary']=='Domestic') ]
us_unitedAirline_ds= air_travel[ (air_travel['Operating Airline']=='United Airlines') & (air_travel['GEO Region']=='US') ]
In [32]:
# Create airline datasets
unitedAirline_ds= air_travel[(air_travel['Operating Airline'].str.startswith('United Airlines'))]
display(unitedAirline_ds.head())
americanAirline_ds= air_travel[air_travel['Operating Airline']=='American Airlines']
klm_ds= air_travel[air_travel['Operating Airline']=='KLM Royal Dutch Airlines']
Activity Period Operating Airline Operating Airline IATA Code Published Airline Published Airline IATA Code GEO Summary GEO Region Activity Type Code Price Category Code Terminal Boarding Area Passenger Count Adjusted Activity Type Code Adjusted Passenger Count Year Month MM YY_MM YY_Month Activity_Period_Start_Date
90 200507 United Airlines UA United Airlines UA Domestic US Enplaned Other Terminal 1 B 60939 Enplaned 60939 2005 July 07 2005-07 2005-July 2005-07-01
92 200507 United Airlines UA United Airlines - Pre 07/01/2013 UA Domestic US Enplaned Low Fare Terminal 3 E 58277 Enplaned 58277 2005 July 07 2005-07 2005-July 2005-07-01
94 200507 United Airlines UA United Airlines - Pre 07/01/2013 UA Domestic US Enplaned Other Terminal 3 F 421802 Enplaned 421802 2005 July 07 2005-07 2005-July 2005-07-01
97 200507 United Airlines UA United Airlines - Pre 07/01/2013 UA International Asia Enplaned Other International G 68278 Enplaned 68278 2005 July 07 2005-07 2005-July 2005-07-01
100 200507 United Airlines UA United Airlines - Pre 07/01/2013 UA International Australia / Oceania Enplaned Other International G 8234 Enplaned 8234 2005 July 07 2005-07 2005-July 2005-07-01

3.2 Create Formatted Full + Training + Testing dataset from Main Data source using function¶

We will create function to create 3 formmated dataset - Full_Dataset, and Train_Dataset, Test_Dataset which we will be used in various function ahead to plot and calculate Foreasting model calculation and plot forecasting output.

In [33]:
######## Function to Create Traing and Testing Dataset from Dataframe 
def create_dataset(df, train_size_percentage):
    full_dataset=df[['YY_MM','Adjusted Passenger Count']].groupby('YY_MM').sum()
    train_size = round((train_size_percentage / 100) * (len(full_dataset)))
    test_size = len(full_dataset) - train_size
    train_dataset, test_dataset = full_dataset[0:train_size], full_dataset[train_size:len(full_dataset)]
    return full_dataset, train_dataset, test_dataset
In [34]:
# Create Formatted Dataset for Price Category Code
low_fare_full_df, low_fare_train_df , low_fare_test_df = create_dataset(low_fare_price_ds, train_pctg)
other_full_df, other_train_df , other_test_df = create_dataset(low_fare_price_ds, train_pctg)
In [35]:
# Create Formatted Dataset for GEO Region
us_full_df, us_train_df , us_test_df = create_dataset(us_ds, train_pctg)
canada_full_df, canada_train_df , canada_test_df = create_dataset(canada_ds, train_pctg)
asia_full_df, asia_train_df , asia_test_df = create_dataset(asia_ds, train_pctg)
europe_full_df, europe_train_df , europe_test_df = create_dataset(europe_ds, train_pctg)
australia_full_df, australia_train_df , australia_test_df = create_dataset(australia_ds, train_pctg)
mexico_full_df, mexico_train_df , mexico_test_df = create_dataset(mexico_ds, train_pctg)
central_america_df, central_america_train_df , central_america_test_df = create_dataset(central_america_ds, train_pctg)
middle_east_df, middle_east_train_df , middle_east_test_df = create_dataset(middle_east_ds, train_pctg)
In [36]:
# Create Formatted Dataset for GEO Summary
domestic_full_df, domestic_train_df , domestic_test_df = create_dataset(domestic_ds, train_pctg)
international_full_df, international_train_df , international_test_df = create_dataset(international_ds, train_pctg)
In [37]:
# Create Formatted Dataset for Combined datasets
us_otherPriceCode_full_df, us_otherPriceCode_train_df , us_otherPriceCode_test_df = create_dataset(us_otherPriceCode_ds, train_pctg)
domestic_lowFarePriceCode_full_df, domestic_lowFarePriceCode_train_df , domestic_lowFarePriceCode_test_df = create_dataset(domestic_lowFarePriceCode_ds, 66)
us_unitedAirline_full_df, us_unitedAirline_train_df , us_unitedAirline_test_df = create_dataset(us_unitedAirline_ds, train_pctg)
In [38]:
# Create Formatted Dataset for Operating Airline
unitedAirline_full_df, unitedAirline_train_df , unitedAirline_test_df = create_dataset(unitedAirline_ds, train_pctg)
americanAirline_full_df, americanAirline_train_df , americanAirline_test_df = create_dataset(americanAirline_ds, train_pctg)
klm_full_df, klm_train_df , klm_test_df = create_dataset(klm_ds, train_pctg)
In [39]:
####### Function to Create Yearly Traing and Testing Dataset from Dataframe 
def create_yearly_dataset(df, train_size_percentage):
    full_dataset=df[['Year','Adjusted Passenger Count']].groupby('Year').sum()
    train_size = round((train_size_percentage / 100) * (len(full_dataset)))
    test_size = len(full_dataset) - train_size
    train_dataset, test_dataset = full_dataset[0:train_size], full_dataset[train_size:len(full_dataset)]
    return full_dataset, train_dataset, test_dataset
In [40]:
low_fare_yearly_full_df, low_fare_yearly_train_df , low_fare_yearly_test_df = create_yearly_dataset(low_fare_price_ds, train_pctg)
other_yearly_full_df, other_yearly_train_df , other_yearly_test_df = create_yearly_dataset(other_price_ds, train_pctg)

For the purpose of this project we will take one or more Dataset from Each Section¶

From Price Category Code: Other price cateogory code
From GEO Region: US
From GEO Summary: International
From Combined attributes:
Domestic - Low fare price category code
From Operating Airline:
KLM
United Airline
American Airline

3.3 Summary Statistics¶

Summary Statistics summarizes and provides the gist of information about the sample data. It can include mean, median, mode, minimum value, maximum value, range, standard deviation, etc. Below are the summary statistics of the United Airlines and US region datasets.

In [41]:
klm_full_df.describe()
Out[41]:
Adjusted Passenger Count
count 129.000000
mean 9062.155039
std 2584.274473
min 4421.000000
25% 6542.000000
50% 9513.000000
75% 11633.000000
max 12888.000000
In [42]:
unitedAirline_full_df.describe()
Out[42]:
Adjusted Passenger Count
count 129.000000
mean 654412.736434
std 92823.320734
min 453529.000000
25% 591246.000000
50% 649946.000000
75% 715628.000000
max 874616.000000

3.4 Plot Series Graphs¶

Graphical visualizations help us understand the trends, seasonality, and residual component in the dataset. This information is required to select a suitable Forecasting Model. We will create a function to plot the following graphs:

  1. Line Plot
  2. Histogram Plot
  3. Density Plot
  4. Box Plot
  5. Detrend Residual EDA Plot
    We will take the log transform of data using Boxcox transformation with lambda=0 and then remove the trend component from the series to plot a series graph along with histogram, density, residual & autocorrelation plots.
In [44]:
def plot_dataseries(df, name):
    from matplotlib import pyplot
    fig, ax = plt.subplots(1, 4 ,figsize=(25, 4) )
    df = df.reset_index()
    df_box= df
    df_box['year'] = df_box['YY_MM'].astype(str).str[:4]
    df_box = df[['year','Adjusted Passenger Count']]
    series= df['Adjusted Passenger Count']
    plt.subplot(1, 4, 1)
    series.plot(title="Series Plot for "+name)
    
    plt.subplot(1, 4, 2)
    series.hist()
    plt.title("Histogram for "+name)
    
    plt.subplot(1, 4, 3)
    series.plot(kind='kde',title="Density Plot for "+name)
   
    plt.subplot(1, 4, 4 )
    sns.boxplot(x=df_box['year'], y=df_box['Adjusted Passenger Count'] , palette="Blues" , width=0.3)
    # Rotating X-axis labels
    plt.tick_params(axis='x', labelrotation = 50)
    
    plt.title("Boxplot for "+name)
    plt.show()


def plot_detrend_series(df , dataset_name ):
    fig, axs = pyplot.subplots(2, 3 ,figsize=(25, 6) )
    series = df
    X= [i for i in range(0, len(df))]
    X = np.reshape(X, (len(X), 1))
    y = boxcox(df['Adjusted Passenger Count'],0.0)  ## Used BoxCoX Transformation to do log transformation with lambda = 0.0
    model = LinearRegression()
    model.fit(X, y)
    trend = model.predict(X)
    # plot trend
    pyplot.subplot(2,3,1)
    pyplot.plot(y)
    pyplot.plot(trend)
    pyplot.title("Log Transform Trend Plot for "+dataset_name)
    trend_df= DataFrame(trend)
    # detrend
    detrended = [y[i]-trend[i] for i in range(0, len(series))]
    #####################################################################  
    # plot detrended
    pyplot.subplot(2, 3, 2)
    pyplot.plot(detrended)
    pyplot.title("De-Trend Plot for "+dataset_name )
    residuals = DataFrame(detrended)
  
    pyplot.subplot(2, 3, 3)
    pyplot.hist(residuals)
    pyplot.title("Histogram of DeTrend Series for "+dataset_name )
    
    # density plot
    pyplot.subplot(2, 3, 4)
    sns.kdeplot(detrended )
    pyplot.title("Density plot for DeTrend Series for "+dataset_name )
   
    pyplot.subplot(2, 3, 5)
    qqplot(residuals, line='r', ax=pyplot.gca())
    pyplot.title("QQ plot DeTrend Series for "+dataset_name )

    pyplot.subplot(2, 3, 6)
    autocorrelation_plot(residuals, ax=pyplot.gca())
    pyplot.title("Autocoraltion DeTrend Series for "+dataset_name )
    pyplot.show()
    return residuals
In [45]:
plot_dataseries(klm_full_df,'KLM')
print("-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------")
series_detrend=plot_detrend_series( klm_full_df, "KLM")
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Observations of KLM¶

  • As we can see from the Line plot KLM time series shows quite a pattern where there is a constant mean, variance and autocorrelation. There appears to be significant seasonality present in the dataset — i.e. significant shifts in the time series trend that occur at certain time intervals. From a visual inspection of the time series, it appears that this shift happens approximately every eight months or so.
  • The Density function suggests that the distribution is not Gaussian.
  • Box Plot:
    • The median values for each year may show an increasing trend. The year 2016 was an exception. We do not see outliers in the data.
    • The spread or middle 50% of the data (blue boxes) appears reasonably stable.
  • De-Trend Transform plot:
    • The Log Transform linear plot shows a trend component in the series.
    • The Density plot is skewed to the left.
    • The auto-correlation plot shows a strong seasonal component. We see some significant lag at various points and will investigate this seasonal component using our forecasting model. Seasonal Model might be better fit model for this series.
In [46]:
plot_dataseries(unitedAirline_full_df,'United Airline')
print("-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------")
series_detrend=plot_detrend_series( unitedAirline_full_df, "United Airline")
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Observations of United Airline¶

  • The Line plot suggests that the United Airlines dataset has an additive increase over time with some seasonal spikes.
  • The Density function suggests that the distribution is not Gaussian.
  • Box Plot:
    • The median values for each year may show an increasing trend. The year 2016 was an exception. We dont see outliers in the data.
    • The spread or middle 50% of the data (blue boxes) appears reasonably stable.
  • De-Trend Transform plot:
    • The Log Transform linear plot shows a clear trend component in the series.
    • The Density plot is slightly skewed to the left.
    • The auto-correlation plot shows a strong seasonal component. We see some significant lag at various points and will investigate this seasonal component using our forecasting model.Seasonal Model might be better fit model for this series.

Observations of US GEO region¶

  • The Line plot suggests that the US GEO region dataset has a seasonal component in series with an increasing trend over time.
  • The Density function suggests that the distribution is not Gaussian.
  • Box Plot:
    • The median values for each year may show an increasing trend. The year 2016 was an exception. We dont see outliers in the data.
    • The spread or middle 50% of the data (blue boxes) appears reasonably stable.
  • De-Trend Transform plot:
    • The Log Transform linear Plot of the US GEO region shows a trend component in it. The de-trend series shows a small variation at the end of the series while the initial months have quite a variation.
    • The Density plot also shows a left lag. The Residual plot is homoskedastic.
    • The auto-Correlation plot shows a strong seasonal component. We see some significant lag at various points and will investigate this seasonal component using our forecasting model. Seasonal Model might be better fit model for this series.

4. Create, Fit Model and Generate Forecast¶

We will evaluate below Forecasting Models:

  • 4.1 Print Autocorelation Params
  • 4.2 Linear Regression Forecasting Model(Deg=2) with BoCox Transformation
  • 4.3 ARIMA Forecasting Model
  • 4.4 Autoregression Forecasting Model
  • 4.5 SARIMAX Forecasting Model
  • 4.6 Prophet Foreacst Forecasting Model

4.1 Print Autocorelation Params¶

In [51]:
def print_plot_autocorelation(df,months_in_year, dataset_name):
    from pandas import Series
    #from statsmodels.tsa.stattools import adfuller
    def difference(dataset, interval=1):
        diff = list()
        for i in range(interval, len(dataset)):
            value = dataset[i] - dataset[i - interval]
            diff.append(value)
        return Series(diff)

    X = pd.Series( df['Adjusted Passenger Count']).values #series.values
    #X= series.values
    #print(X)
    X = X.astype('float')
    # difference data
    months_in_year = 12
    stationary = difference(X, months_in_year)
    stationary.index = df.index[months_in_year:]
    # check if stationary
    result = adfuller(stationary)
    print("ADF Statistic for :" + dataset_name)
    print('ADF Statistic: %f' % result[0])
    print('p-value: %f' % result[1])
    print('Critical Values:')
    for key, value in result[4].items():
        print('\t%s: %.3f' % (key, value))
    ## Plot Stationary
    pyplot.figure(figsize=(20,4))
    pyplot.subplot(131)
    stationary.plot(title = "Stationary Series Plot for "+ dataset_name)
    # Rotating X-axis labels
    plt.tick_params(axis='x', labelrotation = 50)
    
    pyplot.subplot(132)
    plot_acf(stationary, lags=11, ax=pyplot.gca(), title = "Autocorelation plot for :" + dataset_name)
    pyplot.subplot(133)
    plot_pacf(stationary, lags=11, ax=pyplot.gca(), title = "Partial Autocorelation plot for :" + dataset_name)
    pyplot.show()
In [52]:
print_plot_autocorelation(klm_train_df,12, "KLM")
ADF Statistic for :KLM
ADF Statistic: -2.525378
p-value: 0.109421
Critical Values:
	1%: -3.518
	5%: -2.900
	10%: -2.587

Autocorelation Observation for KLM¶

  • The results show that the test statistic value -2.5 is greater than the critical value at 1% of -3.5. This suggests that we cannot reject the null hypothesis and can conclude that the process has a unit root, and in turn that the time series is not stationary
  • The ACF shows significant lags and declines slowly.
  • The PACF shows a significant lag for 1 month, with perhaps some significant lag at 2nd month.
In [53]:
print_plot_autocorelation(unitedAirline_train_df,12, "United Airlines")
ADF Statistic for :United Airlines
ADF Statistic: -2.260012
p-value: 0.185189
Critical Values:
	1%: -3.506
	5%: -2.895
	10%: -2.584

Autocorelation Observation for United Airline¶

  • The results show that the test statistic value -2.26 is greater than the critical value at 1% of -3.506. This suggests that we cannot reject the null hypothesis and can conclude that the process has a unit root, and in turn that the time series is not stationary
  • The autocorrelations decline slowly. The initial lags are significant.
  • The PACF shows a significant lag for 1 month, with perhaps some significant lag at 2nd month.

4.2 Linear Regression Forecasting Model(Deg=2) with BoCox Transformation¶

In [59]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import mean_squared_error
warnings.filterwarnings("ignore")
import matplotlib.ticker as ticker 
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

def boxcox_inverse(value, lam):
    if lam == 0:
        return exp(value)
    return exp(log(lam * value + 1) / lam)

def calculate_boxcox_linreg(df, dataset_name, train_percent):
    series =  df 
    X = DataFrame(series.values)
    X.columns = ['Adjusted Passenger Count']
    X['Adjusted Passenger Count'], lam = boxcox(X['Adjusted Passenger Count'])
    print('BoxCox lambda: %f' %lam)
    pyplot.figure(figsize=(20,3) )
    # line plot
    pyplot.subplot(131)
    pyplot.plot(X['Adjusted Passenger Count'])
    pyplot.title("BoxCox Series Plot :"+ dataset_name)
    # histogram
    pyplot.subplot(132)
    pyplot.hist(X['Adjusted Passenger Count'])
    pyplot.title("BoxCox Series Historgram Plot :"+dataset_name)
    #pyplot.show()
    pyplot.subplot(133)
    qqplot(X['Adjusted Passenger Count'], line='r', ax=pyplot.gca())
    pyplot.title("BoxCox Series QQ Plot :"+dataset_name)
    pyplot.show()
###################################################################
    X = pd.Series( df['Adjusted Passenger Count']).values #series.values
    X = X.astype('float32')
    train_size = int(len(X) * train_percent / 100)
    train, test = X[0:train_size], X[train_size:]

    train_x = np.arange(len(train))
    history = [x for x in train]
    predictions = list()
    for i in range(len(test)):
        transformed ,lam = boxcox(history)
        if lam < -5:
            transformed, lam = history, 1
        fit = np.polyfit(np.arange(len(transformed)), transformed, deg=2)
        fit_function = np.poly1d(fit)
        prediction = fit_function(len(train) + i)
        yhat = boxcox_inverse(prediction, lam)
        predictions.append(yhat)
        # observation
        obs = test[i]
        #history.append(obs)
        print('>Predicted=%.3f, Expected=%.3f' % (yhat, obs))
    # report performance
    rmse = sqrt(mean_squared_error(test, predictions))
    print( "For "+dataset_name + " Dataset : " +'Test RMSE: %.3f' % rmse)
    
    mae = mean_absolute_error(test, predictions)
    print( "For "+dataset_name + " Dataset : " +'Test MAE: %.3f' % mae)
    
    mape = mean_absolute_percentage_error(test, predictions)
    print( "For "+dataset_name + " Dataset : " +'Test MAPE: %.3f' % mape)
    
    train_linreg=list()
    for i in range(len(train)):
        linreg= fit_function(i)
        train_l =boxcox_inverse(linreg, lam)
        train_linreg.append(train_l)    
    pyplot.figure(figsize=(12,5), dpi=100)
    print('BoxCox Lambda of Transformation used: %.3f' % lam)
    
    pyplot.plot(train,  label='Training Data', linewidth=2)
    pyplot.plot( train_linreg , 'r', label='Train_LinReg', linewidth=2)
    pyplot.plot([None for i in train] + [x for x in test] , 'b', label='Testing Data', linewidth=2)
    pyplot.plot([None for i in train] + [x for x in predictions], 'r--', label='Forecast Data', linewidth=2)
    pyplot.title("Linear Regression with BoxCox Transform \n Observed Vs Predicted Plot for " + dataset_name )
    pyplot.legend(loc='upper left', fontsize=8)
    pyplot.show()
    model_summary.loc[('Linear Regression(BoxCox)', 'RMSE'), 
                  dataset_name], model_summary.loc[('Linear Regression(BoxCox)', 'MAE'), 
                dataset_name], model_summary.loc[('Linear Regression(BoxCox)', 'MAPE'), 
                dataset_name]= rmse, mae, mape
In [60]:
calculate_boxcox_linreg(klm_full_df, "KLM", train_pctg)
BoxCox lambda: 1.163519
>Predicted=9050.380, Expected=6256.000
>Predicted=9026.256, Expected=5244.000
>Predicted=9001.523, Expected=4695.000
>Predicted=8976.180, Expected=6797.000
>Predicted=8950.226, Expected=10771.000
>Predicted=8923.659, Expected=12097.000
>Predicted=8896.477, Expected=11966.000
>Predicted=8868.680, Expected=11023.000
>Predicted=8840.265, Expected=12130.000
>Predicted=8811.232, Expected=11774.000
>Predicted=8781.577, Expected=10433.000
>Predicted=8751.300, Expected=5822.000
>Predicted=8720.399, Expected=6447.000
>Predicted=8688.872, Expected=6222.000
>Predicted=8656.718, Expected=5012.000
>Predicted=8623.933, Expected=6327.000
>Predicted=8590.517, Expected=10831.000
>Predicted=8556.467, Expected=11745.000
>Predicted=8521.781, Expected=11633.000
>Predicted=8486.458, Expected=10562.000
>Predicted=8450.494, Expected=11510.000
>Predicted=8413.888, Expected=11669.000
>Predicted=8376.637, Expected=10221.000
>Predicted=8338.739, Expected=7366.000
>Predicted=8300.191, Expected=7321.000
>Predicted=8260.992, Expected=5930.000
>Predicted=8221.137, Expected=5338.000
>Predicted=8180.626, Expected=7726.000
For KLM Dataset : Test RMSE: 2685.737
For KLM Dataset : Test MAE: 2542.532
For KLM Dataset : Test MAPE: 0.330
BoxCox Lambda of Transformation used: 1.261
In [61]:
calculate_boxcox_linreg(unitedAirline_full_df, "United Airline", train_pctg)
BoxCox lambda: 0.144278
>Predicted=753981.198, Expected=727154.000
>Predicted=760119.839, Expected=644051.000
>Predicted=766393.839, Expected=591906.000
>Predicted=772804.963, Expected=718590.000
>Predicted=779355.015, Expected=747788.000
>Predicted=786045.834, Expected=804017.000
>Predicted=792879.296, Expected=839154.000
>Predicted=799857.315, Expected=839442.000
>Predicted=806981.844, Expected=860524.000
>Predicted=814254.874, Expected=744977.000
>Predicted=821678.433, Expected=775863.000
>Predicted=829254.590, Expected=683178.000
>Predicted=836985.453, Expected=725061.000
>Predicted=844873.171, Expected=636301.000
>Predicted=852919.933, Expected=591246.000
>Predicted=861127.968, Expected=731963.000
>Predicted=869499.546, Expected=719732.000
>Predicted=878036.982, Expected=801291.000
>Predicted=886742.629, Expected=848604.000
>Predicted=895618.887, Expected=874616.000
>Predicted=904668.195, Expected=865614.000
>Predicted=913893.038, Expected=771810.000
>Predicted=923295.945, Expected=795921.000
>Predicted=932879.490, Expected=711754.000
>Predicted=942646.290, Expected=738399.000
>Predicted=952599.010, Expected=633387.000
>Predicted=962740.360, Expected=615685.000
>Predicted=973073.097, Expected=752729.000
For United Airline Dataset : Test RMSE: 152895.434
For United Airline Dataset : Test MAE: 122828.409
For United Airline Dataset : Test MAPE: 0.178
BoxCox Lambda of Transformation used: 0.334

4.3 ARIMA Forecasting Model¶

4.3.1 Grid Search for Arima Paramater¶

Define a function to perform a search of ARIMA hyperparameters. Search p, d, and q combinations (skipping those that fail to converge). Return the combination that results in the best performance for the test dataset. Use grid search to explore all combinations in a set of integer values.

In [66]:
import warnings
from math import sqrt
from pandas import read_csv
from pandas import datetime
import statsmodels.tsa.arima_process as arima
from statsmodels.tsa.arima.model import ARIMA, ARIMAResults
from sklearn.metrics import mean_squared_error

warnings.filterwarnings("ignore")

# evaluate an ARIMA model for a given order (p,d,q)
def evaluate_arima_model(X, arima_order, train_percent):
    # prepare training dataset
    train_size = int(len(X) * train_percent / 100)
    train, test = X[0:train_size], X[train_size:]
    history = [x for x in train]
    #print(history)
    # make forecast prediction value
    predictions = list()
    for t in range(len(test)):
        model = ARIMA(history, order=arima_order)
        model_fit = model.fit()
        yhat = model_fit.forecast()[0]
        predictions.append(yhat)
        history.append(test[t])
    # calculate out of forecast error
    rmse = sqrt(mean_squared_error(test, predictions))
    return rmse

# evaluate combinations of p, d and q values for an ARIMA model
def evaluate_models(dataset, p_values, d_values, q_values, dataset_name, train_percent):
    dataset = dataset.astype('float32')
    best_score, best_cfg = float("inf"), None
    for p in p_values:
        for d in d_values:
            for q in q_values:
                order = (p,d,q)
                try:
                    rmse = evaluate_arima_model(dataset, order, train_percent)
                    if rmse < best_score:
                        best_score, best_cfg = rmse, order
                    #print('ARIMA%s RMSE=%.3f' % (order,rmse))    
                except:
                    continue
    print( "For "+ dataset_name + ' Best ARIMA%s RMSE=%.3f' % (best_cfg, best_score))
    return best_score, best_cfg
In [67]:
p_values = [0, 1, 2, 4]
d_values = range(0, 3)
q_values = range(0, 3)
warnings.filterwarnings("ignore")
klm_rmse, klm_arima_order = evaluate_models(klm_full_df.values, p_values, d_values, q_values,"KLM", train_pctg)
For KLM Best ARIMA(4, 0, 2) RMSE=1089.247

Printing of Intermidiate result for ARIMA Parmater was commented and only Best ARIMA parameter is printed to save the space

In [68]:
p_values = [0, 1, 2, 4]
d_values = range(0, 3)
q_values = range(0, 3)
warnings.filterwarnings("ignore")
unitedAirline_rmse, unitedAirline_arima_order = evaluate_models(unitedAirline_full_df.values, p_values, d_values, q_values,"United Airline", train_pctg)
For United Airline Best ARIMA(4, 0, 2) RMSE=61138.822

4.3.2 Calculate and Plot ARIMA Forecast¶

Define a function to fit ARIMA Model with the test dataset and calculate predicted values for the Testing period. Plot expected Vs Predicted values.

In [73]:
from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot
import statsmodels.tsa.arima_process as arima
from statsmodels.tsa.arima.model import ARIMA, ARIMAResults
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from math import sqrt

def plot_ARIMA(df,test_percent,arima_order,dataset_name ):
    X = df
    train_size = int(len(X) * test_percent / 100)
    train, test = X[0:train_size], X[train_size:len(X)]
    history = [x for x in train.values]
    predictions = list()
    # walk-forward validation
    for t in range(0,len(test)):
        model = ARIMA(history, order= arima_order ) #(0,1,1))
        model_fit = model.fit()
        output = model_fit.forecast()
        yhat = output[0]
        predictions.append(yhat)
        obs = test.values[t]
        history.append(obs)
        print('predicted=%f, expected=%f' % (yhat, obs))
    # evaluate forecasts
    rmse = sqrt(mean_squared_error(test, predictions))
    print( "For "+dataset_name + " Dataset : " +'Test RMSE: %.3f' % rmse)
    
    mae = mean_absolute_error(test, predictions)
    print( "For "+dataset_name + " Dataset : " +'Test MAE: %.3f' % mae)
    mape = mean_absolute_percentage_error(test, predictions)
    print( "For "+dataset_name + " Dataset : " +'Test MAPE: %.3f' % mape)
    # plot forecasts against actual outcomes
    pyplot.figure(figsize=(12,5), dpi=100)
    
    fig, ax = plt.subplots(figsize=(20, 5) )
    # Rotating X-axis labels
    ax.tick_params(axis='x', labelrotation = 50)
    space = 6
    ax.xaxis.set_major_locator(ticker.MultipleLocator(space))
    pyplot.plot(train,  label='Training Data', linewidth=2)
    pyplot.plot(test, 'b', label='Testing Data', linewidth=2)
    pyplot.plot(pd.DataFrame(predictions, test.index), 'r--', label='Forecast Data', linewidth=2)
    
    pyplot.title("ARIMA Forecast \n Observed Vs Predicted Plot for " + dataset_name )
    pyplot.legend(loc='upper left', fontsize=8)
    pyplot.show()
    model_summary.loc[('ARIMA', 'RMSE'), 
                  dataset_name], model_summary.loc[('ARIMA', 'MAE'), 
                dataset_name], model_summary.loc[('ARIMA', 'MAPE'), 
                dataset_name]  = rmse, mae, mape
In [74]:
plot_ARIMA(klm_full_df, train_pctg, (klm_arima_order),"KLM"  )
predicted=5297.472540, expected=6256.000000
predicted=5673.101081, expected=5244.000000
predicted=5672.259565, expected=4695.000000
predicted=6505.316181, expected=6797.000000
predicted=8853.055613, expected=10771.000000
predicted=11787.058107, expected=12097.000000
predicted=12761.302912, expected=11966.000000
predicted=11737.889137, expected=11023.000000
predicted=11286.507209, expected=12130.000000
predicted=10883.749284, expected=11774.000000
predicted=9376.018807, expected=10433.000000
predicted=7742.058672, expected=5822.000000
predicted=5126.216997, expected=6447.000000
predicted=5957.682558, expected=6222.000000
predicted=6416.567716, expected=5012.000000
predicted=6914.291419, expected=6327.000000
predicted=8507.037679, expected=10831.000000
predicted=11986.056251, expected=11745.000000
predicted=12395.646447, expected=11633.000000
predicted=12242.021466, expected=10562.000000
predicted=11006.839812, expected=11510.000000
predicted=10498.249075, expected=11669.000000
predicted=9191.318117, expected=10221.000000
predicted=7433.781773, expected=7366.000000
predicted=5990.243765, expected=7321.000000
predicted=6686.649529, expected=5930.000000
predicted=6633.196530, expected=5338.000000
predicted=7342.435504, expected=7726.000000
For KLM Dataset : Test RMSE: 1089.247
For KLM Dataset : Test MAE: 936.701
For KLM Dataset : Test MAPE: 0.119
<Figure size 1200x500 with 0 Axes>
In [75]:
plot_ARIMA(unitedAirline_full_df, train_pctg, (unitedAirline_arima_order),"United Airline"  )
predicted=627664.046981, expected=727154.000000
predicted=716906.066515, expected=644051.000000
predicted=625576.128550, expected=591906.000000
predicted=603241.063232, expected=718590.000000
predicted=636523.086197, expected=747788.000000
predicted=785509.829848, expected=804017.000000
predicted=748545.684213, expected=839154.000000
predicted=815298.635011, expected=839442.000000
predicted=823814.826971, expected=860524.000000
predicted=781566.852220, expected=744977.000000
predicted=742376.872709, expected=775863.000000
predicted=708143.470031, expected=683178.000000
predicted=681656.231765, expected=725061.000000
predicted=710721.018150, expected=636301.000000
predicted=595317.569935, expected=591246.000000
predicted=624896.586169, expected=731963.000000
predicted=697115.358376, expected=719732.000000
predicted=763896.281637, expected=801291.000000
predicted=748036.648619, expected=848604.000000
predicted=841584.516292, expected=874616.000000
predicted=862408.632754, expected=865614.000000
predicted=777647.483805, expected=771810.000000
predicted=781593.993193, expected=795921.000000
predicted=717107.818166, expected=711754.000000
predicted=716882.466594, expected=738399.000000
predicted=712289.239268, expected=633387.000000
predicted=590729.603371, expected=615685.000000
predicted=655440.676462, expected=752729.000000
For United Airline Dataset : Test RMSE: 61138.822
For United Airline Dataset : Test MAE: 48985.629
For United Airline Dataset : Test MAPE: 0.067
<Figure size 1200x500 with 0 Axes>

4.4 Autoregression Forecasting Model¶

Autoregression is a linear regression model that uses lagged variables as input.The statsmodels library provides an AutoReg class that can be instantiated passing a lag value parameter. After creating the model we call fit() method to train it on our dataset which returns an AutoRegResults object. Once fit, we can use the model to make a predict future values by calling the predict() function.

In [80]:
def plot_autoreg(df , test_percent, dataset_name) :
    from statsmodels.tsa.ar_model import AutoReg
    from sklearn.metrics import mean_squared_error
    from sklearn.metrics import mean_absolute_error
    from sklearn.metrics import mean_absolute_percentage_error
    from math import sqrt
    
    X = df
    train_size = int(len(X) * test_percent / 100)
    train, test = X[0:train_size], X[train_size:len(X)]
    
    # train autoregression Model 
    window = 12
    model = AutoReg(train.values, lags=12)
    model_fit = model.fit()
    print("Autoregression Forecast for "+dataset_name)
    #print(model_fit.summary())
    coef = model_fit.params
    # walk forward over time steps in test
    history = train.values[len(train)-window:]
    #print(history)
    history = [history[i] for i in range(len(history))]
    predictions = list()
    for t in range(len(test)):
        length = len(history)
        lag = [history[i] for i in range(length-window,length)]
        yhat = coef[0]
        for d in range(window):
            yhat += coef[d+1] * lag[window-d-1]
        obs = test.values[t]
        predictions.append(yhat)
        history.append(obs)
        print('predicted=%f, expected=%f' % (yhat, obs))
    # Print Forecasting Errors to evaluation
    rmse = sqrt(mean_squared_error(test, predictions))
    print("For "+dataset_name + " Dataset : " +'Test RMSE: %.3f' % rmse)
    
    mae = mean_absolute_error(test, predictions)
    print("For "+dataset_name + " Dataset : " +'Test MAE: %.3f' % mae)
    
    mape = mean_absolute_percentage_error(test, predictions)
    print("For "+dataset_name + " Dataset : " +'Test MAPE: %.3f' % mape)
    
    # plot forecasts against actual outcomes
    pyplot.figure(figsize=(12,5), dpi=100)
    
    fig, ax = plt.subplots(figsize=(20, 5) )
    # Rotating X-axis labels
    ax.tick_params(axis='x', labelrotation = 50)
    space = 6
    ax.xaxis.set_major_locator(ticker.MultipleLocator(space))
    pyplot.plot(train,  label='Training Data', linewidth=2)
    pyplot.plot(test, 'b', label='Testing Data', linewidth=2)
    pyplot.plot(pd.DataFrame(predictions, test.index), 'r--', label='Forecast Data', linewidth=2)
    
    pyplot.title("Autoregression Forecast \n Observed Vs Predicted Plot for " + dataset_name )
    pyplot.legend(loc='upper left', fontsize=8)
    pyplot.show()
    model_summary.loc[('Auto Regression', 'RMSE'), 
                  dataset_name], model_summary.loc[('Auto Regression', 'MAE'), 
                dataset_name], model_summary.loc[('Auto Regression', 'MAPE'), 
                dataset_name]   = rmse, mae, mape
In [81]:
plot_autoreg(klm_full_df , train_pctg, "KLM")
Autoregression Forecast for KLM
predicted=5390.621050, expected=6256.000000
predicted=5464.292496, expected=5244.000000
predicted=4983.271111, expected=4695.000000
predicted=5871.312458, expected=6797.000000
predicted=9679.825112, expected=10771.000000
predicted=12575.738599, expected=12097.000000
predicted=12730.879380, expected=11966.000000
predicted=12043.588589, expected=11023.000000
predicted=11665.712694, expected=12130.000000
predicted=11628.530433, expected=11774.000000
predicted=10159.478245, expected=10433.000000
predicted=7285.231787, expected=5822.000000
predicted=5533.644140, expected=6447.000000
predicted=5721.738229, expected=6222.000000
predicted=5559.787519, expected=5012.000000
predicted=6289.698440, expected=6327.000000
predicted=9434.477162, expected=10831.000000
predicted=12590.295134, expected=11745.000000
predicted=12454.230054, expected=11633.000000
predicted=11630.430279, expected=10562.000000
predicted=11431.790294, expected=11510.000000
predicted=11492.272158, expected=11669.000000
predicted=10254.010601, expected=10221.000000
predicted=7170.188997, expected=7366.000000
predicted=6442.477499, expected=7321.000000
predicted=6663.081950, expected=5930.000000
predicted=5436.517374, expected=5338.000000
predicted=6170.904480, expected=7726.000000
For KLM Dataset : Test RMSE: 777.294
For KLM Dataset : Test MAE: 638.596
For KLM Dataset : Test MAPE: 0.080
<Figure size 1200x500 with 0 Axes>
In [82]:
plot_autoreg(unitedAirline_full_df , train_pctg, "United Airline")
Autoregression Forecast for United Airline
predicted=663624.602202, expected=727154.000000
predicted=675266.822014, expected=644051.000000
predicted=622358.328467, expected=591906.000000
predicted=688398.619584, expected=718590.000000
predicted=679307.626092, expected=747788.000000
predicted=791136.439380, expected=804017.000000
predicted=819565.053535, expected=839154.000000
predicted=868372.218292, expected=839442.000000
predicted=879847.217961, expected=860524.000000
predicted=781980.091317, expected=744977.000000
predicted=746390.050650, expected=775863.000000
predicted=696227.394836, expected=683178.000000
predicted=715968.177886, expected=725061.000000
predicted=712795.311407, expected=636301.000000
predicted=631458.552499, expected=591246.000000
predicted=705462.174218, expected=731963.000000
predicted=723698.260468, expected=719732.000000
predicted=809762.575510, expected=801291.000000
predicted=834198.322803, expected=848604.000000
predicted=884874.625571, expected=874616.000000
predicted=910027.876354, expected=865614.000000
predicted=794147.459832, expected=771810.000000
predicted=786916.865479, expected=795921.000000
predicted=703907.383278, expected=711754.000000
predicted=730099.878923, expected=738399.000000
predicted=706329.262353, expected=633387.000000
predicted=637739.754874, expected=615685.000000
predicted=721062.574319, expected=752729.000000
For United Airline Dataset : Test RMSE: 34771.434
For United Airline Dataset : Test MAE: 28288.749
For United Airline Dataset : Test MAPE: 0.040
<Figure size 1200x500 with 0 Axes>

ARIMA- SARIMAX Forecasting Model¶

4.5.1 Visualizing Seasonal Decomposition of Time series data¶

In [87]:
def plot_seasonal_decomposition(df, dataset_name):
    from pylab import rcParams
    import statsmodels.api as sm

    df= df.groupby('Activity_Period_Start_Date')['Adjusted Passenger Count'].sum().reset_index()
    df=df.set_index('Activity_Period_Start_Date')
    y = df['Adjusted Passenger Count'].resample('MS').mean()
    
    print("Decomposition Plots for ", dataset_name )
    rcParams['figure.figsize'] = 12, 8
    decomposition = sm.tsa.seasonal_decompose(y, model='additive')
    fig = decomposition.plot()
    plt.show()
In [88]:
plot_seasonal_decomposition(klm_ds,'KLM')
Decomposition Plots for  KLM
In [89]:
plot_seasonal_decomposition(unitedAirline_ds,'United Airline')
Decomposition Plots for  United Airline

4.5.2 Calculate SARIMAX Parameter¶

In [94]:
def compute_SARIMAX_param(df, dataset_name ):
    
    df= df.groupby('Activity_Period_Start_Date')['Adjusted Passenger Count'].sum().reset_index()
    df=df.set_index('Activity_Period_Start_Date')
    y = df['Adjusted Passenger Count'].resample('MS').mean()
    
    p = d = q = range(0, 2)
    pdq = list(itertools.product(p, d, q))
    seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]


    best_aic, best_cfg , best_scfg= float("inf"), None, None
    for param in pdq:
        for param_seasonal in seasonal_pdq:
            try:
                mod = sm.tsa.statespace.SARIMAX(y,
                                                order=param,
                                                seasonal_order=param_seasonal,
                                                enforce_stationarity=False,
                                                enforce_invertibility=False)
                results = mod.fit()

                #print('ARIMA{}x{}12 - AIC:{} - BIC:{}'.format(param, param_seasonal, results.aic, results.bic))

                if results.aic < best_aic:
                    best_aic, best_cfg, best_scfg = results.aic , param , param_seasonal
            except:
                continue
    print("For "+ dataset_name +': Best AIC:{} SARIMAX: {} x {}'.format(best_aic, best_cfg, best_scfg))
    return results.aic , param , param_seasonal
In [95]:
klm_aic , klm_param , klm_param_seasonal = compute_SARIMAX_param(klm_ds , 'KLM' )
For KLM: Best AIC:1648.5400930612086 SARIMAX: (1, 1, 1) x (0, 1, 1, 12)
In [96]:
unitedAirline_aic , unitedAirline_param , unitedAirline_param_seasonal = compute_SARIMAX_param(unitedAirline_ds , 'United Airline' )
For United Airline: Best AIC:2326.082245597009 SARIMAX: (0, 1, 1) x (0, 1, 1, 12)

4.5.3 Plot SARIMAX Model Summary and SARIMAX Forecast¶

In [101]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import mean_squared_error

def plot_SARIMAX_FCST_model_summary(df, arima_order, sea_order, dataset_name):
    
    df= df.groupby('Activity_Period_Start_Date')['Adjusted Passenger Count'].sum().reset_index()
    df=df.set_index('Activity_Period_Start_Date')
    y = df['Adjusted Passenger Count'].resample('MS').mean()
    
    mod = sm.tsa.statespace.SARIMAX(y,
                                order= arima_order, #(0, 1, 1),
                                seasonal_order = sea_order, #   (0, 1, 1, 12),
                                #enforce_stationarity=False,
                                enforce_invertibility=False)

    results = mod.fit()
    print("SARIMAX Model Summary for "+ dataset_name)
    print(results.summary().tables[1])
    results.plot_diagnostics(figsize=(16, 8))
    plt.show()
    
    pred = results.get_prediction(start=pd.to_datetime('2015-01-01'), dynamic=False)
    pred_ci = pred.conf_int()
    
    ax = y['2005':].plot(label='observed')
    pred.predicted_mean.plot(ax=ax, label='SARIMAX Forecast', alpha=.7, figsize=(14, 7))

    ax.fill_between(pred_ci.index,
                    pred_ci.iloc[:, 0],
                    pred_ci.iloc[:, 1], color='k', alpha=.2)
    ax.set_xlabel('Date')
    ax.set_ylabel( dataset_name +'  Adjusted Passenger Count')
    ax.set_title ( " SARIMAX Validate Forecast With Test Data For "+dataset_name + " Adjusted Passenger Count")
    plt.legend(loc='upper left', fontsize=8)
    plt.show()
    
    y_forecasted = pred.predicted_mean
    y_truth = y['2015-01-01':]
    
    #print(y_forecasted)
    #print(y_truth)
    # Calculate  the Root mean square error
     
    rmse = sqrt(mean_squared_error(y_truth, y_forecasted))
    print("For "+dataset_name + " Dataset : " +'Test RMSE: %.3f' % rmse)
    
    # Calculate  the  mean absolute error
    mae = mean_absolute_error(y_truth, y_forecasted)
    print( "For "+dataset_name + " Dataset : " +'Test MAE: %.3f' % mae)
    
    # Calculate  the  mean absolute percentage error
    mape = mean_absolute_percentage_error(y_truth, y_forecasted)
    print( "For "+dataset_name + " Dataset : " +'Test MAPE: %.3f' % mape)
    
    pred_uc = results.get_forecast(steps=60)
    pred_ci = pred_uc.conf_int()
    ax = y.plot(label='observed', figsize=(14, 7))
    pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
    ax.fill_between(pred_ci.index,
                    pred_ci.iloc[:, 0],
                    pred_ci.iloc[:, 1], color='k', alpha=.25)
    ax.set_xlabel('Date')
    ax.set_title("  SARIMAX Forecast for Next 5 Years : "+dataset_name)
    ax.set_ylabel(dataset_name +'Adjusted Passenger Count')

    plt.legend()
    plt.show()
    model_summary.loc[('SARIMAX', 'RMSE'), 
                  dataset_name], model_summary.loc[('SARIMAX', 'MAE'), 
                dataset_name], model_summary.loc[('SARIMAX', 'MAPE'), 
                dataset_name]   = rmse, mae, mape
In [102]:
plot_SARIMAX_FCST_model_summary(klm_ds, unitedAirline_param, klm_param_seasonal, 'KLM')
SARIMAX Model Summary for KLM
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.4099      0.486     -0.844      0.399      -1.362       0.542
ma.L1          0.2310      0.524      0.441      0.660      -0.797       1.259
ar.S.L12      -0.1689      0.187     -0.905      0.366      -0.535       0.197
ma.S.L12      -0.2909      0.227     -1.282      0.200      -0.736       0.154
sigma2      6.011e+05   6.16e+04      9.762      0.000     4.8e+05    7.22e+05
==============================================================================
For KLM Dataset : Test RMSE: 681.323
For KLM Dataset : Test MAE: 527.306
For KLM Dataset : Test MAPE: 0.073
In [103]:
plot_SARIMAX_FCST_model_summary(unitedAirline_ds, unitedAirline_param, unitedAirline_param_seasonal, 'United Airline')
SARIMAX Model Summary for United Airline
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.7284      1.720      0.423      0.672      -2.644       4.100
ma.L1         -0.7372      1.706     -0.432      0.666      -4.081       2.607
ar.S.L12       0.6499      0.146      4.463      0.000       0.365       0.935
ma.S.L12      -0.7920      0.173     -4.576      0.000      -1.131      -0.453
sigma2      5.009e+08    4.7e-10   1.07e+18      0.000    5.01e+08    5.01e+08
==============================================================================
For United Airline Dataset : Test RMSE: 18927.785
For United Airline Dataset : Test MAE: 15910.831
For United Airline Dataset : Test MAPE: 0.022

4.6 Prophet Forecasting Model¶

In [108]:
import datetime
def create_Prophet_df(df):
    df= df[['Activity_Period_Start_Date','Adjusted Passenger Count']]
    df = df.groupby('Activity_Period_Start_Date')['Adjusted Passenger Count'].sum().reset_index() #.to_frame('Adjusted Passenger Count')
    #df= df.reset_index()
    df = df.set_index('Activity_Period_Start_Date')
    df = df['Adjusted Passenger Count'].resample('MS').mean()
    df = pd.DataFrame({'Activity_Period_Start_Date':df.index, 'Adjusted Passenger Count':df.values})
    return df
In [109]:
unitedAirline_prophet_dataset  = create_Prophet_df(unitedAirline_ds)
us_prophet_dataset  = create_Prophet_df(us_ds)
other_price_prophet_dataset  = create_Prophet_df(other_price_ds)
international_prophet_dataset  = create_Prophet_df(international_ds)
domestic_lowFarePriceCode_prophet_dataset  = create_Prophet_df(domestic_lowFarePriceCode_ds)
americanAirline_prophet_dataset  = create_Prophet_df(americanAirline_ds)
klm_prophet_dataset  = create_Prophet_df(klm_ds)
In [110]:
#!pip install pystan
#!pip install convertdate
#!pip install lunarcalendar
#!pip install lunardate
#!pip install holidays
#!pip install ephem
#!pip install prophet
In [111]:
from pandas import read_csv
from pandas import to_datetime
from pandas import DataFrame
from prophet import Prophet
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from matplotlib import pyplot
  
def plot_prophet_forecast(df, train_percent, dataset_name):
    df.columns = ['ds', 'y']
    train_size = int(len(X) * train_percent / 100)
    test_size = len(df) - train_size
    train, future = df[0:train_size], df
    future = future.drop(['y'], axis=1)
    # define the model
    model = Prophet()
    # fit the model
    model.fit(train)
     # use the model to make a forecast    
    #print(future.tail(test_size))
    
    forecast=model.predict(future)
    #print(forecast)
    fig1 = model.plot(forecast)
    ax = fig1.gca()
    ax.set_title("Plot of the forecast for " + dataset_name, size=12)
    
    fig1 = model.plot_components(forecast)     
    ax = fig1.gca()
    ax.set_title("The components of the forecast for: " + dataset_name, size=8)
    
    # calculate MAE between expected and predicted values 
    y_true = df['y'][-test_size:].values
    #print(y_true)
    y_pred = forecast['yhat'][-test_size:].values
    #print(y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    print('MAE: %.3f' % mae)
    mape = mean_absolute_percentage_error(y_true, y_pred)
    print('MAPE: %.3f' % mape)
    
    pyplot.figure(figsize=(10,5), dpi=100)

    pyplot.plot(train['y'],  label='Training Data', linewidth=2)
    pyplot.plot([None for i in train['y']] + [x for x in y_true] , 'b', label='Testing Data', linewidth=2)
    pyplot.plot([None for i in train['y']] + [x for x in y_pred], 'r--', label='Forecast Data', linewidth=2)

    pyplot.title(" Prophet Forecast \n Observed Vs Predicted Plot for "  + dataset_name )
    pyplot.legend(loc='upper left', fontsize=8)

    pyplot.legend()
    pyplot.show()
    model_summary.loc[('Prophet', 'MAE'), 
                      dataset_name], model_summary.loc[('Prophet', 'MAPE'), 
                      dataset_name] = mae, mape
In [112]:
train_pctg_prophet = 79 
plot_prophet_forecast(klm_prophet_dataset, train_pctg_prophet, "KLM")
12:45:33 - cmdstanpy - INFO - Chain [1] start processing
12:45:33 - cmdstanpy - INFO - Chain [1] done processing
MAE: 511.827
MAPE: 0.065

Observations for KLM¶

  • We seee that there is a significant growth in the trend from 2007 up until 2009, with passenger numbers levelling off after that.
  • We also observe that passenger numbers appear to be highest from approximately May — September, after which we see a dip in numbers for the rest of the year.
  • Model Validation: Now that the forecasting model has been developed, we can compare the predicted passenger numbers to the test data in order to evaluate model accuracy. Mean Forecast Error is 511. With a mean of approximately 9000 passengers per month — the errors are quite low in comparison to this figure — indicating that the model is performing well in forecasting monthly passenger numbers.
In [113]:
plot_prophet_forecast(unitedAirline_prophet_dataset, train_pctg_prophet, "United Airline")
12:45:35 - cmdstanpy - INFO - Chain [1] start processing
12:45:35 - cmdstanpy - INFO - Chain [1] done processing
MAE: 49676.393
MAPE: 0.071

5. Evaluate Model & Conclusion Summary¶

This project focuses on developing models to forecast air travel demand. Time Series Analysis and Forecasting is performed using several statistical Models and visualizations. The crucial step here is to verify the model's correctness by doing an error measurement. It is essential to note the forecasting deviation from the actual data. We will evaluate Forecasting Models using MAPE - Mean Absolute PercentageError. The table below shows the MAPE, MAE (and RMSE in some cases) for each forecasting model used to generate the projected data.

If the value of data is large, e.g., Adjusted Passengers Count ranging from 1000 to 1000,000, the use of MAPE method is a better choice. The reason is that the value of MAE may be too big and lead to confusion. For example, for a data value of 10,000, the value for MAE is 500 and the corresponding value for MAPE is 5%, which is within good engineering tolerance. However, if the absolute value of 500 is used, it is quite large in an absolute sense and just leads to confusion.

In [118]:
# Define a function to highlight minimum value
def highlight_min_mae(x):
   min_val = model_summary.loc[(slice(None), ["MAE"]), :].min()[x.name]
   return ['background-color: gold' if val == min_val else '' for val in x]

# Define a function to highlight minimum MAPE value
def highlight_min_mape(x):
   min_val = model_summary.loc[(slice(None), ["MAPE"]), :].min()[x.name]
   return ['background-color: yellow' if val == min_val else '' for val in x]

# Apply the function to the DataFrame
highlighted_df = model_summary.style.apply(highlight_min_mape)

# Display the highlighted DataFrame
highlighted_df  
Out[118]:
    KLM United Airline US GEO Region Other price category code International GEO Summary Domestic Low Fare
Linear Regression(BoxCox) RMSE 2685.737314 152895.433783 182783.021964 45061.787454 55470.076661 42233.340956
MAE 2542.531729 122828.409131 140841.017055 32284.756073 46089.507968 30036.874134
MAPE 0.329932 0.178330 0.097860 0.092683 0.103915 0.087194
ARIMA RMSE 1089.246865 61138.821906 123301.060220 32222.489131 36794.801265 32082.360631
MAE 936.700759 48985.628795 94641.184292 25387.216639 32860.195834 24273.419569
MAPE 0.119040 0.066857 0.061137 0.067356 0.076280 0.064415
Auto Regression RMSE 777.294130 34771.434079 47893.535137 26527.394827 22200.133503 26831.447808
MAE 638.595839 28288.749407 39416.883434 20430.226713 18759.625438 21018.442733
MAPE 0.079590 0.039734 0.026006 0.054614 0.042402 0.056434
SARIMAX RMSE 681.323248 18927.784672 36049.018413 32408.617739 8366.976434 10514.497452
MAE 527.306047 15910.830757 31936.121924 28725.828745 6840.929062 9355.119672
MAPE 0.073190 0.021543 0.020538 0.017461 0.015193 0.023767
Prophet MAE 511.826505 49676.393422 32746.651315 37290.179675 22127.973104 30407.522015
MAPE 0.065351 0.070902 0.021498 0.022758 0.046498 0.082329
In [119]:
model_summary_mape = model_summary.stack().reset_index()
model_summary_mape = model_summary_mape[model_summary_mape['level_1'] == 'MAPE']
# Changing columns name 
model_summary_mape.rename(columns={"level_2": "Dataset", "level_0": "Forecasting_Model", 0:"MAPE"}, inplace=True)
sns.barplot(x='MAPE', y='Dataset', hue= 'Forecasting_Model', data=model_summary_mape, orient='h', palette='viridis')
Out[119]:
<Axes: xlabel='MAPE', ylabel='Dataset'>

Observations:¶

  1. For KLM, Prophet and SARIMAX Models have the lowest MAEs and MAPEs. By numbers Prophet appears to be more Suitable.
  2. For Uniter Airline SARIMAX Model appears to be the Best Fit, considering the lowest MAPE and MAE among 5 Forecasting Models. Auto Regression has the next lowest MAPE.
  3. For the US GEO region Dataset, SARIMAX and Prophet look to have the lowest MAPEs and MAEs, but SARIMAX is more suitable.
  4. For Other Price Category Code, SARIMAX is clearly a more suitable model. Prophet has the next lowest MAPE.
  5. The SARIMAX Forecast has the lowest MAPE and MAE for the International GEO Summary category.
  6. SARIMAX is the best-fit model with the lowest MAPE and MAE for the Domestic Low Fare segment.

Overall SARIMAX model seems suitable for most of the sub-datasets we tested.

Further Opportunities:¶

To expand the study of this dataset, we can run the models against sub-datasets for other regions, and carriers and evaluate the model’s fit with a larger number of datasets.

Appendix:¶

Data Source: https://data.world/data-society/air-traffic-passenger-data
Source Code: https://github.com/naeemasma/cscie83/tree/main

In [ ]: